Predicting Alpha Thalassemia Phenotype using Clinical Measures
Ashley Williams
Alpha thalassemia is a prevalent genetic disorder with a wide spectrum of clinical severity, ranging from asymptomatic carrier states to presenting fatal severe anemia. Genotype-phenotype correlations are generally strong but variable, necessitating reliable prognostic tools for clinical management and improved disease prevention. This project aimed to develop and validate a logistic regression model to predict the clinical phenotype (either silent carrier or alpha trait status) of alpha thalassemia patients using clinical variables.
Results indicate that a combination of hemoglobin concentration, red blood cell count, and lymphocyte percentage are strong predictors of phenotypic severity. The final model achieved predictive accuracy and allows for easy interpretation, demonstrating its potential use as a clinical decision support tool. The resulting model provides a cost-effective method to aid in alpha thalassemia assessment and prevention with particular application in low-resource settings where advanced testing is inaccessible.
Alpha thalassemia is an inherited blood disorder causing the body to produce an insufficient amount of hemoglobin, thus leading to anemia
hb, Hemoglobin concentration in grams per decilitre -
g/dLrbc, Red blood cell volume in 10^12/Llymph, Percentage of white blood cells that are
lymphocytesneut, Percentage of white blood cells that are
neutrophilsplt, Total platelet count in 10^6/Lphenotype, Phenotype of the patient, either Silent
Carrier or Alpha TraitI obtained my data for this project from Kaggle.
mch, mean
corpuscular hemoglobin. However, I am not using this variable in my
model so this is not a concern.sex and phenotype into factors as
described below, such that they can be used in my logistic regression
approach.
For this project, I am going to employ a binary logistic regression approach.
Logistic regression is a method used to predict the probability of a discrete outcome of two mutually exclusive events.
Logistic regression analyzes the relationship between the target and predictor variables by utilizing a logistic function to model the probability of an event occurring, rather than a continuous value as seen in linear regression.
To complete my logistic regression approach, I utilize several R packages such as: caret, nnet, pROC, and pscl.
silent_carrier alpha_trait
74 30
Testing data:
silent_carrier alpha_trait
31 12
For this model, I did phenotype ~ hb +
rbc + lymph
This model was selected after testing several full and reduced models, as it ultimately had the best performance on key logistic regression outputs.
A binary logistic regression model was fitted predicting the
phenotype from hemoglobin concentration hb,
red blood cell volume rbc and percent of lymphocytes in
white blood cell count lymph.
Call:
glm(formula = formula_logit, family = binomial, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.57589 2.24398 1.594 0.111038
hb -0.69551 0.20504 -3.392 0.000694 ***
rbc 0.95577 0.46345 2.062 0.039182 *
lymph -0.03237 0.01848 -1.752 0.079794 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.96 on 103 degrees of freedom
Residual deviance: 109.47 on 100 degrees of freedom
AIC: 117.47
Number of Fisher Scoring iterations: 4
Analysis of Deviance Table
Model 1: phenotype ~ 1
Model 2: phenotype ~ hb + rbc + lymph
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 103 124.96
2 100 109.47 3 15.487 0.001444 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test (LRT) compares the logistic regression model to a null model containing only an intercept. The residual deviance decreases from 124.96 to 109.47 when the predictors are added, producing a test statistic of 15.487 on 3 degrees of freedom (p<0.01 ). This small p-value indicates that the predictors significantly improve model fit compared to the null model. The included clinical variables provide substantial explanatory power for predicting the phenotype of Alpha Thalassemia.
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-54.7364213 -62.4799152 15.4869877 0.1239357 0.1383562 0.1978586
Pseudo-R2 values provide additional measures of model fit for logistic regression. McFadden’s pseudo-R2 was 0.1239, which indicates moderate fit. The maximum-likelihood R2 (r2ML = 0.1383, Cox & Snell) similarly suggests improvement over the null model. Nagelkerke’s pseudo-R2 was 0.1979, meaning the model achieves about 19.79% of the maximum possible improvement in fit, relative to the null model. Overall, these values indicate that the logistic regression model provides a moderate fit to the data that is better for predicting phenotype than the null model.
(Intercept) hb rbc lymph
35.7263857 0.4988216 2.6006603 0.9681505
For each additional increase in g/dL of hemoglobin concentration, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 50.12%, holding all other variables constant. Here, 50.11% is from (0.4988−1=0.5012=50.12%).
For each 1x10^12 cells/L increase in red blood cells, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia more than double (about 2.6 times), holding all other predictors constant.
For each additional 1% of the white blood cells population that is lymphocytes, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 3.18%, holding all other variables constant. Here, 3.18% is from (0.9682-1=0.0318=3.18%).
Confusion Matrix and Statistics
Reference
Prediction silent_carrier alpha_trait
silent_carrier 29 6
alpha_trait 2 6
Accuracy : 0.814
95% CI : (0.666, 0.9161)
No Information Rate : 0.7209
P-Value [Acc > NIR] : 0.1144
Kappa : 0.485
Mcnemar's Test P-Value : 0.2888
Sensitivity : 0.5000
Specificity : 0.9355
Pos Pred Value : 0.7500
Neg Pred Value : 0.8286
Prevalence : 0.2791
Detection Rate : 0.1395
Detection Prevalence : 0.1860
Balanced Accuracy : 0.7177
'Positive' Class : alpha_trait
Using a 0.5 probability cutoff on the test data, the model achieved an accuracy of 81.4%, with 50% sensitivity and 93.55% specificity.
The ROC curve yielded an AUC of 0.878, indicating excellent discrimination between patients with the silent carrier phenotype and those with the alpha trait phenotype.
This logistic regression model demonstrates that a small set of clinical predictors provides explanatory and predictive power for assessing phenotype of Alpha Thalassemia patients, while maintaining easy interpretability for use in clinical screenings and application in disease prevention.
---
title: "AT Analysis"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "#b53533"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
pacman::p_load(caret, nnet, pROC, pscl, tidyverse, DT)
data <- read.csv("twoalphas.csv")
data$phenotype <- ifelse(data$phenotype == "alpha trait", 1, 0)
#alpha trait = 1
#female = 1
data$sex <- ifelse(data$sex == "female", 1, 0)
data2 <- data %>%
mutate(
phenotype = factor(phenotype, levels = c(0, 1),
labels = c("silent_carrier", "alpha_trait")),
sex = factor(sex, levels = c(0, 1),
labels = c("Male", "Female")))
```
Title
===
Column {data-width=450}
---
### <b><span Style="color:#4f0c0b">Title</span></b>
<font size=8><b><span Style="color:#b53533">Predicting Alpha Thalassemia Phenotype using Clinical Measures</span></b></font>
<font size=6><b><span Style="color:#d15c5a">Ashley Williams</span></b></font>
Column {data-width=550}
---
### <b><span Style="color:#4f0c0b">Abstract</span></b>
Alpha thalassemia is a prevalent genetic disorder with a wide spectrum of clinical severity, ranging from asymptomatic carrier states to presenting fatal severe anemia. Genotype-phenotype correlations are generally strong but variable, necessitating reliable prognostic tools for clinical management and improved disease prevention. This project aimed to develop and validate a logistic regression model to predict the clinical phenotype (either silent carrier or alpha trait status) of alpha thalassemia patients using clinical variables.
Results indicate that a combination of hemoglobin concentration, red blood cell count, and lymphocyte percentage are strong predictors of phenotypic severity. The final model achieved predictive accuracy and allows for easy interpretation, demonstrating its potential use as a clinical decision support tool. The resulting model provides a cost-effective method to aid in alpha thalassemia assessment and prevention with particular application in low-resource settings where advanced testing is inaccessible.
Background
===
Column {.tabset data-width=500}
-----------------------------------------------------------------------
### <font size=2.8><span Style="color:#4f0c0b">Background</span></font>
<span Style="color:#b53533">Alpha thalassemia is an inherited blood disorder causing the body to produce an insufficient amount of hemoglobin, thus leading to anemia</span>
- <span Style="color:#d15c5a">Alpha thalassemia occurs when 1 or more of the 4 total alpha-globin genes (2 inherited from each parent), which contribute to the synthesis of hemoglobin molecules, are mutated or deleted.</span>
- <span Style="color:#b53533">There are multiple types of alpha thalassemia with a range of severities. In this project, I focus on the following:
- <b>Alpha thalassemia silent carrier:</b> One alpha-globin gene is affected, the other 3 are wildtype. Blood tests are often normal, but their red blood cells may be smaller than normal. Being a silent carrier means you don’t have signs of the disease, but you can pass the damaged gene on to progeny. This is confirmed by DNA tests.
- <b>Alpha thalassemia trait carrier:</b> Two genes are affected. Patient likely to have mild anemia.</span>
- <span Style="color:#d15c5a">Having 3 affected genes leads to Hemoglobin H disease, where the patient has moderate to severe anemia. Having all 4 affected genes causes severe anemia, where most cases lead to prenatal death.</span>
- <span Style="color:#b53533">There is no cure for Alpha thalassemia. Thus, effective screening to detect Thalassemia carriers is vital to prevention. There are many challenges to an effective screening program, especially in low-resource settings. Considering alpha-thalassemia, genetic testing is needed for a confirmatory diagnosis of a carrier, which is expensive and not widely available. Thus follows the importance of building predictive models that can act as decision-support tools, because they are easy to deploy and use in low-resource settings where other options are limited.<span>
### <font size=2.8><span Style="color:#4f0c0b">Research Questions</span></font>
### <font size=2.8><span Style="color:#4f0c0b">Variables of Interest</span></font>
- <span Style="color:#b53533">This dataset contains 16 total variables, the following were considered in this project:</span>
- `hb`, Hemoglobin concentration in grams per decilitre - g/dL
- `rbc`, Red blood cell volume in 10^12/L
- `lymph`, Percentage of white blood cells that are lymphocytes
- `neut`, Percentage of white blood cells that are neutrophils
- `plt`, Total platelet count in 10^6/L
- `phenotype`, Phenotype of the patient, either Silent Carrier or Alpha Trait
### <font size=2.8><span Style="color:#4f0c0b">Source & Cleaning</span></font>
<span Style="color:#b53533">I obtained my data for this project from [Kaggle](https://www.kaggle.com/datasets/letslive/alpha-thalassemia-dataset?select=twoalphas.csv).</span>
- <span Style="color:#d15c5a">About the dataset</span>
- <span Style="color:#b53533">This dataset is from a database of 288 cases from the Human Genetics Unit (HGU) of the Faculty of Medicine, Colombo, Sri Lanka.</span>
- <span Style="color:#d15c5a">The data used in this project (n=147) was collected from Alpha thalassemia carrier children and their family members screened, from 2016 to 2020.</span>
- <span Style="color:#b53533">Data Cleaning</span>
- <span Style="color:#d15c5a">There is one missing value present in this data set. It was missing for the variable `mch`, mean corpuscular hemoglobin. However, I am not using this variable in my model so this is not a concern.</span>
- <span Style="color:#b53533">I next converted the categorical variables `sex` and `phenotype` into factors as described below, such that they can be used in my logistic regression approach.
- <span Style="color:#d15c5a">Sex, where <b>Male = 0</b> and <b>Female = 1</b>
- Phenotype, where <b>Silent Carrier = 0</b> and <b>Alpha Trait = 1</b></span>
- <span Style="color:#d15c5a">I finally checked the distributions of all the variables for outliers, and while there were some, they were not out of the realm of biological possibility and all 288 observations in the original data thus were included in this study.</span>
Column {.tabset data-width=500}
-----------------------------------------------------------------------
### <span Style="color:#4f0c0b">Data Table</span>
```{r}
datatable(data[1:50,], rownames=FALSE)
```
### <span Style="color:#4f0c0b">Data Cleaning Intro</span>
```{r}
library(DataExplorer)
plot_intro(data)
```
### <span Style="color:#4f0c0b">Data Cleaning Histogram</span>
```{r}
plot_histogram(data)
```
Methods and EDA
===
Column {.tabset data-width=500}
---
### EDA Analysis
- going to discuss changes in median/spread of data between the silent carrier and alpha trait phenotype for each variable. Maybe exclude some that I don't consider in my model?
### Methods
For this project, I am going to employ a binary logistic regression approach.
- Logistic regression is a method used to predict the probability of a discrete outcome of two mutually exclusive events.
- In this case, predicting the probability of a peron being either a silent carrier or possessing the alpha trait phenotype.
- Logistic regression analyzes the relationship between the target and predictor variables by utilizing a logistic function to model the probability of an event occurring, rather than a continuous value as seen in linear regression.
- To complete my logistic regression approach, I utilize several R packages such as: caret, nnet, pROC, and pscl.
Column {.tabset data-width=500}
---
### Phenotype
```{r}
ggplot(data2, aes(x=phenotype))+geom_bar(fill="#d15c5a", color="black")+labs(title="Distribution of Phenotype", x="Phenotype", y="Count") + geom_text(aes(x="alpha_trait", y=47, label="42"))+geom_text(aes(x="silent_carrier", y=110, label="105"))
```
### Hb
```{r}
ggplot(data2, aes(x=phenotype, y=hb))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Hemoglobin concentration by phenotype", x="phenotype", y="hb (g/dL)")
```
### Pcv
```{r}
ggplot(data2, aes(x=phenotype, y=pcv))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of PCV/hematocrit % by phenotype", x="phenotype", y="pcv/hematocrit %")
```
### rbc
```{r}
ggplot(data2, aes(x=phenotype, y=rbc))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of RBC Volume by phenotype", x="phenotype", y="rbc (10^12/L)")
```
### lymph
```{r}
ggplot(data2, aes(x=phenotype, y=lymph))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of lymph by phenotyple", x="phenotyple", y="lymph")
```
### plt
```{r}
ggplot(data2, aes(x=phenotype, y=plt))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of plt by phenotyple", x="phenotyple", y="plt")
```
### mch
```{r}
ggplot(data2, aes(x=phenotype, y=mch))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of mch by phenotyple", x="phenotyple", y="mch")
```
### mchc
```{r}
```
### rdw
```{r}
ggplot(data2, aes(x=phenotype, y=rdw))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of rdw by phenotyple", x="phenotyple", y="rdw")
```
Model Performance
===
Column {.tabset data-width=500}
---
### Set up
- In order to do logistic regression, the target variable needs to partitioned into two groups.
- (1) Training data, used to estimate model parameters.
- (2) Test data, to assess how well the model works on new, unseen data.
- 70% of the data was used for training data, and 30% was reserved for test data.
```{r}
library(caret)
set.seed(11)
idx <- createDataPartition(data2$phenotype, p = 0.7, list = FALSE)
train <- data2[idx, ]
test <- data2[-idx, ]
```
Training data:
```{r}
table(train$phenotype)
```
Testing data:
```{r}
table(test$phenotype)
```
### Model
For this model, I did `phenotype` ~ `hb` + `rbc` + `lymph`
This model was selected after testing several full and reduced models, as it ultimately had the best performance on key logistic regression outputs.
A binary logistic regression model was fitted predicting the `phenotype` from hemoglobin concentration `hb`, red blood cell volume `rbc` and percent of lymphocytes in white blood cell count `lymph`.
```{r}
formula_logit <- phenotype ~ hb + rbc + lymph
logit_model <- glm(formula_logit, data = train, family = binomial)
summary(logit_model)
```
Column {.tabset data-width=500}
---
### Goodness of fit
```{r}
null_model <- glm(phenotype ~ 1, data = train, family = binomial)
anova(null_model, logit_model, test = "Chisq")
```
The likelihood ratio test (LRT) compares the logistic regression model to a null model containing only an intercept. The residual deviance decreases from 124.96 to 109.47 when the predictors are added, producing a test statistic of 15.487 on 3 degrees of freedom (p<0.01
). This small p-value indicates that the predictors significantly improve model fit compared to the null model. The included clinical variables provide substantial explanatory power for predicting the phenotype of Alpha Thalassemia.
```{r}
pR2(logit_model)
```
Pseudo-R2 values provide additional measures of model fit for logistic regression.
McFadden’s pseudo-R2 was 0.1239, which indicates moderate fit.
The maximum-likelihood R2 (r2ML = 0.1383, Cox & Snell) similarly suggests improvement over the null model.
Nagelkerke’s pseudo-R2 was 0.1979, meaning the model achieves about 19.79% of the maximum possible improvement in fit, relative to the null model. Overall, these values indicate that the logistic regression model provides a moderate fit to the data that is better for predicting phenotype than the null model.
### Key effects
```{r}
or <- exp(coef(logit_model))
or
```
- For each additional increase in g/dL of hemoglobin concentration, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 50.12%, holding all other variables constant. Here, 50.11% is from (0.4988−1=0.5012=50.12%).
- For each 1x10^12 cells/L increase in red blood cells, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia more than double (about 2.6 times), holding all other predictors constant.
- For each additional 1% of the white blood cells population that is lymphocytes, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 3.18%, holding all other variables constant. Here, 3.18% is from (0.9682-1=0.0318=3.18%).
### CM
```{r}
test_prob <- predict(logit_model, newdata = test, type = "response")
test_pred <- ifelse(test_prob >= 0.5, "alpha_trait", "silent_carrier") %>%
factor(levels = levels(test$phenotype))
cm <- confusionMatrix(test_pred, test$phenotype, positive = "alpha_trait")
cm
```
Using a 0.5 probability cutoff on the test data, the model achieved an accuracy of 81.4%, with 50% sensitivity and 93.55% specificity.
### ROC/AUC
```{r}
roc_obj <- roc(response = test$phenotype,
predictor = test_prob,
levels = c("silent_carrier", "alpha_trait"),
direction = "<")
plot(roc_obj,
print.auc = TRUE,
legacy.axes = TRUE,
main = "ROC Curve for Alpha Thalassemia Model")
```
The ROC curve yielded an AUC of 0.878, indicating excellent discrimination between patients with the silent carrier phenotype and those with the alpha trait phenotype.
### Conclusion
This logistic regression model demonstrates that a small set of clinical predictors provides explanatory and predictive power for assessing phenotype of Alpha Thalassemia patients, while maintaining easy interpretability for use in clinical screenings and application in disease prevention.
Discussion & Limitations
===
Column {.tabset data-width=1000}
---
### Conclusions
### Limitations
- small sample size
### Future Directions
- apply this to global demographic
- develop approaches for additional phenotypes of Alpha thalassemia
About the Author
===
Column {data-width=500}
---
### Background
My name is Ashley Williams and I am an undergraduate student attending the University of Dayton. I am majoring in Biology and I am minoring in Chemistry, Data Analytics, Neuroscience, and Research in the Biological Sciences. My anticipated graduation is in May of 2027.
I am an undergraduate researcher and have co-authorship of two peer-reviewed scientific papers, one from [2022](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010653) and one from [this year](https://academic.oup.com/mbe/article/42/9/msaf213/8248050)! I conduct my research in the [Williams Lab](https://thetomwilliamslab.com/), where I specifically study the regulation of the <i>Drosophila melanogaster pale</i> gene, and its origin during the evolution of a dimorphic pigmentation trait. I have been heavily involved in scientific research since 2021, and I have also presented my research on numerous occasions including twice at the University of Dayton's <span Style="color:#cf311e">Stander Symposium</span>, at <span Style="color:#267c28">the Society for Developmental Biology's 83rd Annual Meeting</span>, and at <span Style="color:#0066b6">the American Society for Biochemistry and Molecular Biology's conference, "Evolution and core processes in gene regulation"</span>.
I am interested in pursuing a Ph.D. in the field of genetics after my graduation, and continuing my career in academia and biological research.
Column {data-width=500}
---
### Presenting
```{r, fig.width=6, echo=FALSE, fig.align='right'}
knitr::include_graphics("IMG_6751.jpeg")
```